Partitioned Narratives: Thick Mapping the 1947 Partition Archive

Introduction

The documentation below is the white paper for the essay: “Partitioned Narratives: Thick Mapping the 1947 Partition Archive.” It includes the R code and csv files necessary to reproduce the calculations. Some of the spatial manipulations were performed in QGIS 3.16 (Hannover). When possible, images and Python code chunks have provided for reproduceability. Some steps involved converting CSV files to Geopackage files, as this is a common GIS workflow it has been skipped.

Part 1: Priming the NER Extracted Data

Load packages

The following packages: tidyverse, tidygeocoder,tidytext,stringi,htmlTable, are necessary to run this script.

library(tidyverse)
library(tidygeocoder)
library(tidytext)
library(stringi)
library(htmlTable)
library(ggplot2)
library(scales)

Load location data

The loaded csv file is a cleaned up version of the one that results from scraping and running the data through NER. The cleaning process mostly involves removing false positives, consolidating similar locations (i.e. Bombay and Mumbai), and removing any corrupt data. This process also included coding the gender of the narrative and whether a person mentioned their occupation. Finally, for less known or ambigious locations we added the city and district to aid the geotagger.

partition_df <- read_csv("data/post_clean_locations.csv", na = c("", "NA"))

Reformat partition_df

The following procedure primes the data for analysis:

  • An address field is created by uniting the location, city, and country field
  • Remove unnecessary strings from data fields
  • Drop unnecessary columns
  • Keep all distinct addresses by person name. This prevents double counting locations in a person’s account
partition_distinct_locations <-  partition_df %>%
  group_by(name) %>%
  #create address field from locations, city, and country columns
  unite("address",
        locations:country,
        sep = ", ",
        na.rm = TRUE) %>%
  #remove unnecessary text and add total locations column
  mutate(
    age = str_remove(age, "Age in 1947: "),
    migrated_from = str_remove(migrated_from, "Migrated from: "),
    migrated_to = str_remove(migrated_to, "Migrated to: "),
  ) %>%
  #drop unnecessary columns
  select(name:migrated_to, gender:address) %>%
  #keep all distinct addresses. This helps reduce the query time for the geocoder.
  distinct(address, .keep_all = TRUE) %>%
  ungroup()

Part 2: Geocoding

Find distinct addresses

The geocoding the addresses can be quite time consuming. To save time, we can run only the distinct addresses and then join these back to partition_distinct_locations afterwards.

#create a vector of distinct addresses
addresses <- partition_distinct_locations %>%
  distinct(address)

Run geocoder

The script relies on the tidygeocoder package developed by Jesse Cambon, Diego Hernangómez, Christopher Belanger, Daniel Possenriede: tidygeocoder. The package allows users to select the geocoder of their choice. For the purposes of easy reproduceability OpenStreetMap (osm) was selected, though other services that require registration or login might be more accurate. This process is time consuming and has been commented out. The address file has been cached.

#Because the processing time is quite lengthy, this has been commented out. When running custom data remove the comment.

#addresses_geocoded <- geo(addresses$address, method = 'osm', full_results = FALSE)
Skip geocoding and read in the geocoded addresses
addresses_gecoded <- read_csv("data/addresses_geocoded.csv")

Join coordinates to distinct_partition_locations

The coordinates are joined to the existing dataframe distinct_partition_locations.

partition_geolocations <- partition_distinct_locations %>%
  left_join(addresses_gecoded)

Clean final table

The geocoder will not necessarily catch all locations. Some of the locations have to be geocoded and corrected manually. This process is involved, and has to be done through QGIS. Several additional fields were created to keep track of the changes:

  • known - Whether the location was ultimately found. FALSE indication that the location is a best guess

  • camp - Indicates that this was a refugee camp. This data was not used

  • resolved_location - The final location name for the coordinate. There may be a discrepancy between this and the initial address

  • admin - Indicates whether this a larger administrative area within which other locations fall. It also includes rivers. Admin areas are dropped from analysis because they are redundant. Likewise, as the position of the river is often unknowable, this too was dropped.

write_csv(partition_geolocations,
          "data/partition_geolocations_raw.csv")

Part 3: Statistical Overview

Import clean data

Read in the data file partition_geolocations_clean. This file is read-only to prevent accidental file corruption.

partition_clean <-
  read_csv("data/partition_geolocations_clean.csv", na = "NA")

Aggregate location totals

Generate a table for all analysis: only include non-administrative areas, unique locations for each person, counts per person, and total counts per location.

partition_statistics <- partition_clean %>%
  rename(latitude=9) %>%
  rename(longitude=10) %>% 
  filter(admin == FALSE) %>%
   filter(occupation!="No") %>% 
    mutate(PersonID = paste(name,"_",age)) %>% 
  group_by(PersonID) %>% 
  distinct() %>%
  add_count(PersonID, name = "loc_by_name") %>%
  ungroup() %>%
  add_count(resolved_location, name = "loc_total")

General Overview

#Get number of unique locations

unique_locations <- partition_statistics %>%
  ungroup() %>%
  select(resolved_location) %>%
  distinct() %>%
  nrow()

#Get number of unique people
unique_people <- partition_statistics %>%
  ungroup() %>%
  select(PersonID) %>%
  distinct() %>%
  nrow()

#Calculate mean locations mentioned
mean_locations <- partition_statistics %>%
  ungroup() %>%
  summarize(mean_locations = mean(loc_by_name))

There are 768 unique locations in the data set. These are distributed across 320 people. On average, each person mentions 9.49 locations.

Locations by gender

Broken down by gender, it is clear that the mean number of locations by men is higher than that of women.

mean_locations_gender <- partition_statistics %>%
  group_by(gender) %>%
  summarize(mean_gender = round(mean(loc_by_name), 2))

mean_locations_gender %>%
  addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>% 
  htmlTable(header = c("Gender", "Mean Locations Mentioned"))
Gender Mean Locations Mentioned
1 Female 8.67
2 Male 9.88

Locations by occupation

A similar trend emerges when accounting for occupation. Here, people who mention their occupation mention more locations.

mean_locations_occupation <- partition_statistics %>%
  group_by(occupation) %>%
  summarize(mean_occupation = round(mean(loc_by_name), 2))

mean_locations_occupation %>%
  addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>% 
  htmlTable(header = c("Occupation", "Mean Locations Mentioned"))
Occupation Mean Locations Mentioned
1 Not Mentioned 8.06
2 Yes 9.92

Locations by occupation and gender

The contrast between locations mentioned and the gender and whether occupation is mentioned becomes even starker when the values are disaggregated.

mean_locations_occ_gen <- partition_statistics %>%
  group_by(gender, occupation) %>%
  summarize(mean_location = round(mean(loc_by_name), 2))
mean_locations_occ_gen %>% 
   addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>% 
          htmlTable(header = c("Gender", "Occupation", "Mean Locations Mentioned" ))
Gender Occupation Mean Locations Mentioned
1 Female Not Mentioned 8.24
2 Female Yes 9.28
3 Male Not Mentioned 7.19
4 Male Yes 10.05

Percent mention of occupation by gender

Generally, men mentioned their occupations significantly more than women.

partition_statistics %>% 
  group_by(gender) %>%
  select(PersonID, gender, occupation) %>% 
  distinct() %>% 
  count(occupation) %>% 
  mutate(percent = paste(round(n/sum(n),2)*100,"%")) %>% 
  addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>% 
  htmlTable(header = c("Gender", "Occupation", "Number of People", "Percent" ))
Gender Occupation Number of People Percent
1 Female Not Mentioned 70 62 %
2 Female Yes 43 38 %
3 Male Not Mentioned 17 8 %
4 Male Yes 192 92 %

Distribution of locations mentioned

The distribution pattern of locations mentioned shows that men without occupations make a negligible impact on the mean number of locations mentioned. Meanwhile, the number of women without occupations is quite substantial and do tend to mention fewer locations. Even among those who mentione their occupation, the men’s distribution has a longer tail.

partition_statistics %>%
  distinct(PersonID, gender, occupation, loc_by_name) %>%
  ggplot(aes(loc_by_name, fill = gender)) +
  geom_histogram(
    color = "black",
    opacity = .8 ,
    alpha = .4,
    position = "identity"
  ) +
  scale_fill_brewer(palette = "Pastel2") +
  labs(title = "Histogram of Mentioned Locations by Occupation and Gender",
       x = "Occupation",
       y = "Number of Locations Mentioned",
       fill = "Gender") +
  facet_wrap(~ occupation) +
  theme_classic()
Figure 1: Locations Mentioned by Occupation and Gender

Figure 1: Locations Mentioned by Occupation and Gender

T-test and ANOVA score

#Generate t-scores

gender_ttest <- t.test(loc_by_name ~ gender, partition_statistics)
occupation_ttest <- t.test(loc_by_name ~ occupation, partition_statistics)

#Create dataframes of tscores
tscores <- map_df(list(gender_ttest, occupation_ttest), tidy)
tscores <- tscores[c("p.value")]

#Generate variables for ANOVA
gender_occupation <- partition_statistics %>%
  unite("gender_occupation", gender:occupation, remove = FALSE)
anova_gender_occupation <-
  aov(loc_by_name ~ gender_occupation, gender_occupation)

#Create dataframe for ANOVA score
anovascore <- map_df(list(anova_gender_occupation), tidy)
anovascore <- anovascore[c("statistic", "p.value")]

T-tests of both gender and occupation individually affirms what visual inspection already suggests: that the mean distribution is not random. A Welch Two Sample t-test was done both on the difference in means of locations by gender (p = 6.1e-16) and the difference in means of locations by occupation (p = 8.2e-34)affirms what visual inspection already suggests: that the mean distribution is not random. At the same time, an analysis of variance (ANOVA) test reveals an F-score of 44.42 and a p value of 5.7e-28, indicating that the variance between means is greater than the variance within means and not random.

Part 4: Spatial Analysis

The spatial analysis of the data set was done with QGIS. As these manipulations are hard to document, only their result is shown. There were a number of cases where the tidygeotagger did not properly catch all of the locations. These had to be added manually.

Location diversity

Departure locations

#Departure locations
part_from <- partition_statistics %>%
  mutate(migrated_from = str_extract(migrated_from, "[^,]+")) %>%
  drop_na(migrated_from) %>%
  select(PersonID, migrated_from, gender)  %>%
  distinct(PersonID, migrated_from, gender) %>%
  add_count(migrated_from, name = "total_location") %>%
  group_by(gender) %>%
  add_count(migrated_from, name = "loc_gender") %>%
  add_count(gender, name = "gender_tot") %>%
  mutate(percent = loc_gender / gender_tot) %>%
  select(-PersonID,-loc_gender,-gender_tot) %>%
  distinct() %>%
  #ungroup() %>%
  arrange(desc(total_location), gender) %>%
  top_n(5, total_location) %>%
  mutate(percent = percent(percent,2)) %>%
  select(-total_location)

#Number of departure locations by gender
gender_migration <- partition_statistics %>%
  drop_na(migrated_from) %>%
  distinct(migrated_from, gender) %>%
  group_by(gender) %>%
  count(gender)

The first thing that is notable about the departure locations is their diversity. While a plurality of people departed from Lahore (30%) and a second group from Rawalpindi (22%), there were many who departed from quite different locations. In fact, women departed from 43, while men departed from 89 different locations.

We can observe this diversity of points of departure by looking at a spatial representation of the direct lines of travel to Delhi and noting the diversity of points of origin.

Figure 2: Departure locations during Partition

Migrated From Gender Percent departure by Gender
1 Lahore Female 30%
2 Lahore Male 22%
3 Rawalpindi Female 12%
4 Rawalpindi Male 8%
5 Multan Female 4%
6 Multan Male 4%
7 Faisalabad Female 6%
8 Faisalabad Male 4%
9 Dera Ismail Khan Female 6%
10 Dera Ismail Khan Male 2%
Table 3: Top 5 Departure locations by gender

Transit Locations

partition_transfer <- partition_statistics %>%
  #Filter out Delhi as a final location
  filter(resolved_location != "Delhi") %>%
  
  #Clean up the migrated from and migrated to data
  mutate(migrated_from = str_extract(migrated_from, "[^,]+")) %>%
  mutate(migrated_to = str_extract(migrated_to, "[^,]+")) %>%
  
  #Remove all cases where the migrated from location is the same as one of the transit locations
  filter(migrated_from != resolved_location) %>%
  
  #Remobe all cases where the resolved location equals migrated to.
  filter(migrated_to != resolved_location) %>%
  
  #Get the number of transfer locations based on where people migrated from and their gender
  group_by(gender) %>%
  mutate(total_gender = n_distinct(PersonID)) %>%
  group_by(migrated_from, gender) %>%
  add_count(resolved_location, name = "migration_location", sort =
              TRUE) %>%
  #Calcuate the percentage as a share of all migration locations
  mutate(percent_transit = migration_location / total_gender) %>%
  
  #Clean up table for presentation
  select(migrated_from,
         gender,
         resolved_location,
         migration_location,
         percent_transit) %>%
  distinct(migrated_from, resolved_location, percent_transit) %>%
  arrange(desc(percent_transit), resolved_location) %>%
  ungroup() %>%
  top_n(10, percent_transit) %>%
  mutate(percent_transit = percent(percent_transit, 2)) %>%
  relocate(gender, .before = migrated_from)

Likewise the transit locations were also quite diverse. Amritsar occurs more frequently for both women (10%) and (8%), but does not stand out as the majority locations.

#Generate table
partition_transfer %>%
  addHtmlTableStyle(col.rgroup = c("none", "#F5FBFF")) %>%
  htmlTable(header = c(
    "Gender",
    "Migrated From",
    "Transfer",
    "Percent of Respondents Transfered"
  ))
Gender Migrated From Transfer Percent of Respondents Transfered
1 Female Lahore Amritsar 10%
2 Male Lahore Amritsar 8%
3 Female Rawalpindi Lahore 6%
4 Female Lahore Rawalpindi 6%
5 Female Lahore Anarkali Bazaar 4%
6 Female Lahore Shimla 4%
7 Male Lahore Rawalpindi 4%
8 Female Faisalabad Amritsar 4%
9 Female Rawalpindi Karol Bagh 4%
10 Female Lahore Mumbai 4%
11 Female Lahore Mussoorie 4%

The spatial analysis requires several manipulations of the data that were done in QGIS. What follows is a brief outline.

Create from Locations

Note: the geocoding process is skipped for the purposes of this notebook

  • Subset the data into from hubs.
hub_from <- partition_statistics %>% 
            select(migrated_from) %>% 
            drop_na() %>% 
            filter(migrated_from!="TBA") %>% 
            distinct()
  • Geocode each from hub.
#hub_from_geo <- geo(hub_from$migrated_from, method = 'osm', full_results = FALSE)
  • Attach data back to from_hubs.
hub_from_join <- hub_from_geo %>%
  rename(migrated_from = address)

hub_from_join <- partition_statistics %>%
  left_join(hub_from_join)

hub_from_join <- hub_from_join %>%
  select(name, age, migrated_from, gender, occupation, lat, long) %>%
  filter(migrated_from != "TBA") %>%
  drop_na(migrated_from) %>%
  distinct()
  • Write from_hub file for geoprocessing.
write_csv(hub_from_join,"data/from_hubs.csv")
  • osm Will not necessarily catch all locations. Some of these have to be hand coded.
hub_from_join_clean <- read_csv("data/from_hubs_clean.csv")
  • Create line geometry for departure locations to Delhi.
hub_from_join_clean <- hub_from_join_clean %>% 
mutate(WKT = paste("LINESTRING(",long," ", lat, ",", "","77.2219388","28.6517178)")) 
write_csv(hub_from_join_clean, "data/hubs_to_delhi.csv") 
  • Measure distance from from_hub to Pakistan border using the NNjoin plugin for QGIS.
distance_to_border <- read_csv("data/distance_to_border.csv")

Evaluating distance to border

#Get the mean distance by gender
group_mean <- distance_to_border %>%
  group_by(gender) %>%
  summarise(grp_mean = mean(distance_km),
            group_median = median(distance_km))

#Get percentage of people who travelled more than 100km
more_than_100 <- distance_to_border %>%
  mutate(n = n()) %>%
  filter(distance_km > 100)  %>%
  summarise(more_than = n() / n) %>%
  distinct()

The path of travel to the border was quite distant for the majority of interviewees. With men and women both traveling more than 100km on average, and the median distance also exceeding 100km (women = 105km, men = 128km). Even though it is a rather arbitrary distance, the majority of people (59%) traveled more than 100km to get to the border. The sense that the majority interviewees travelled from quite far to even get to the border is also born out in the distribution of distances travelled.

distance_to_border %>%
  group_by(gender) %>%
  ggplot(aes(distance_km, fill = gender)) +
  geom_histogram(
    color = "black",
    opacity = .8 ,
    alpha = .4,
    position = "identity"
  ) +
  scale_color_brewer(palette = "Pastel2") +
  scale_fill_brewer(palette = "Pastel2") +
  labs(title = "Histogram of Distance to Border by Gender",
       x = "Distance in km",
       y = "Count",
       fill = "Gender") +
  facet_wrap(~ gender) +
  theme_classic() +
  geom_vline(data = group_mean,
             aes(xintercept = grp_mean, color = gender),
             linetype = "dashed") +
  theme(legend.position = "none") +
  geom_text(data = group_mean,
            aes(
              x = grp_mean,
              y = 0,
              label = paste("Mean Distance = ", round(grp_mean, 0), "km"),
              hjust = -.05,
              vjust = -22
            ))
Figure 3: Distribution of Distance to Border

Figure 3: Distribution of Distance to Border

Analyzing Hub and Spokes Model

Using QGIS it is possible to take all of the locations in each narrative and attach them to a central hub in this case Delhi.

to_hubs <- partition_statistics  %>% 
  filter(resolved_location!="Delhi") %>% 
mutate(WKT = paste("LINESTRING(",longitude," ", latitude, ",", "","77.2219388","28.6517178)"))  %>% 
  select(name,age,gender,occupation,resolved_location,migrated_from,migrated_to,PersonID,loc_by_name,loc_total,WKT)
write_csv(to_hubs, "data/hub_and_spoke.csv")

Figure 4: Locations Mentioned in the Interviews